Regulating data storage based on popularity

ABSTRACT

Technology is disclosed for regulating data storage based on a popularity of data files (“the technology”). Various embodiments of the technology includes maintaining a fixed durability level of data files stored in a storage system by regulating a number of copies of the data files stored in the storage system. One embodiment includes regulating the number of copies of a particular data file based on popularity of the particular data file among various users using the storage system. The number of copies stored in the storage system is increased or decreased, including from/to zero, based on the popularity of the particular data file. The popularity is determined based on at least one of: a number of computing devices of various users having the particular data file, a latency, network bandwidth and/or availability with the computing devices for reading the particular data file, or access pattern of the particular data file.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims to the benefit of U.S. Provisional Patent Application No. 61/708,794, entitled “CLOUD COMPUTING INTEGRATED OPERATING SYSTEM”, which was filed on Oct. 2, 2012, which is incorporated by reference herein in its entirety.

TECHNICAL FIELD

Several of the disclosed embodiments relate to data storage, and more particularly, to regulating a number of copies of a data FILE to be stored in a storage system based on a popularity of the data file.

BACKGROUND

Current storage services such as cloud storage services allow users to store various multi-media content such as music files, video files, images, documents, etc. in the cloud. In order to provide for recovery from data loss, the cloud storage services typically replicate the content and store various copies of the content at different storage systems, and probably at different locations. This requires huge amounts of storage resources and other associated infrastructure and maintenance resources to maintain the data centers. This can result in increased costs. Further, some of the data files stored for various users can be identical. For example, a music file such as “Optimistic” by “Radiohead” is the same for any user storing that music file with cloud storage service. The current cloud storage services store multiple copies of the same file, e.g., one for each user who uploaded the music file, which results in a significant amount of space being used for storing identical files. Accordingly, the current storage services are inefficient at least in terms of managing the available storage space.

SUMMARY

Technology is disclosed for regulating data storage based on a popularity of data files (“the technology”). Various embodiments of the technology provide for maintaining a fixed durability level of data files stored in a storage system by regulating a number of copies of the data files stored in the storage system. One such embodiment includes regulating the number of copies of a particular data file stored in the storage system based on a popularity of the particular data file among various users who use the storage system. The number of copies stored in the storage system is increased or decreased, including from/to zero copies, based on the popularity of the particular data file. Further, the storage system can store either a complete data file or for a portion of the data file. Accordingly, the technology is applicable to either the complete data file or a portion of the data file.

In some embodiments, the popularity of the particular data file is determined by computing a popularity value for the particular data file. The popularity value of the particular data file can be determined based on a number of factors, including one or more of: (a) a number of computing devices associated with one or more of the users that contain the particular data file, (b) a latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, (c) a network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, (d) availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file, (e) a number of the users requiring storage for the same data file at the storage system, or (f) access pattern of the particular data file for a specific user or a subset of the users. In some embodiments, one or more the above factors can be weighted relative to each other.

The popularity value can be determined in various units and using various mathematical equations. One example expression of a popularity value can include a percentage value, where a popularity value of 100% can indicate that all the users serviced by the storage system have a copy of the particular data file on all their computing devices, the particular data file can be fetched from any of the computing devices with a minimum latency, the particular data file is accessed frequently etc. On the other hand, a popularity value of 0% can indicate that none of the users have a copy of the particular data file or it is not possible to retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased or decreased, including from/to zero copies, based on the popularity value. For example, if the popularity value of a particular data file is 100%, the storage system may not store any copies of the particular data file since the particular data file is available at all the computing devices of the users and can be retrieved from any of the computing devices at any time. On the other hand, if the popularity value of a particular data file is 0% the storage system may store one or more copies of the particular data file since the particular data file is not available at any of the computing devices or cannot be retrieved within a maximum accepted latency etc. Generally, the higher the popularity of the data file, the lower the number of copies of the data file that need to be stored at the storage system. Further, various popularity value ranges and number of copies that can be stored for each of the ranges can be configured, e.g., by an entity such as an administrator of the storage server.

When a user requests a particular data file, a server determines whether the particular data file is available at the storage system. If the particular data file is available at the storage system, the server serves the request by fetching the file from the storage system. On the other hand, if the particular data file is not available at the storage system, the server serves the request by fetching the file from any of the other computing devices of the user and/or any of the computing devices of other users that contain the particular data file.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an environment where data storage regulation for maintaining a specified durability level for the data files at a storage system can be implemented.

FIG. 2 illustrates an example system that regulates a number of copies of data files stored at a storage system based on a popularity value of the corresponding data file, consistent with various embodiments of the disclosed technology.

FIG. 3 illustrates an example of a system for serving a particular data file from the storage system, consistent with various embodiments of the disclosed technology.

FIG. 4 illustrates a block diagram of a server that regulates the number of copies of the data files stored at the storage system based on the popularity values of the corresponding data files, consistent with various embodiments of the disclosed technology.

FIG. 5 illustrates a flow diagram for regulating data storage at a storage system based on a popularity value of data files, consistent with various embodiments of the disclosed technology.

FIG. 6 illustrates an example process for serving a particular data file from the storage system, consistent with various embodiments of the disclosed technology.

FIG. 7 is a block diagram of a computer system as may be used to implement features of some embodiments of the disclosed technology.

DETAILED DESCRIPTION

Technology is disclosed for regulating data storage based on a popularity of data (“the technology”). Various embodiments of the technology provide for maintaining a fixed durability level of data files stored in a storage system by regulating a number of copies of the data files stored in the storage system. One such embodiment includes regulating the number of copies of a particular data file stored in the storage system based on the popularity of the particular data file among various users who use the storage system. The number of copies stored in the storage system is increased or decreased, including from/to zero copies, based on the popularity of the particular data file. Further, the storage system can store either a complete data file or for a portion of the data file. Accordingly, the technology is applicable to either the complete data file or a portion of the data file.

In some embodiments, the popularity of the particular data file is determined by computing a popularity value for the particular data file. The popularity value of the particular data file can be determined based on a number of factors, including one or more of: (a) a number of computing devices associated with one or more of the users that contain the particular data file, (b) a latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, (c) a network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, (d) availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file, (e) a number of the users requiring storage for the same data file at the storage system, or (f) access pattern of the particular data file for a specific user or a subset of the users. In some embodiments, one or more the above factors can be weighted relative to each other.

The popularity value can be determined in various units and using various mathematical equations. One example expression of a popularity value can include a percentage value, where a popularity value of 100% can indicate that all the users serviced by the storage system have a copy of the particular data file on all their computing devices, the particular data file can be fetched from any of the computing devices with a minimum latency, the particular data file is accessed frequently etc. On the other hand, a popularity value of 0% can indicate that none of the users have a copy of the particular data file or it is not possible to retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased or decreased, including from/to zero copies, based on the popularity value. For example, if the popularity value of a particular data file is 100% the storage system may not store any copies of the particular data file since the particular data file is available at all the computing devices of the users and can be retrieved from any of the computing devices at any time. On the other hand, if the popularity value of a particular data file is 0% the storage system may store one or more copies of the particular data file since the particular data file is not available at any of the computing devices or cannot be retrieved within a maximum accepted latency etc. Generally, the higher the popularity of the data file, the lower the number of copies of the data file that need to be stored at the storage system. Further, various popularity value ranges and number of copies that can be stored for each of the ranges can be configured, e.g., by an entity such as an administrator of the storage server.

When a user requests a particular data file, a server determines whether the particular data file is available at the storage system. If the particular data file is available at the storage system, the server serves the request by fetching the file from the storage system. On the other hand, if the particular data file is not available at the storage system, the server serves the request by fetching the file from any of the other computing devices of the user and/or any of the computing devices of other users that contain the particular data file.

Environment

FIG. 1 illustrates an environment where data storage regulation for maintaining a specified durability level for the data files at a storage system can be implemented. The system 100 includes a storage system 105 for storing data files received from computing devices 130-140 of users. The system 100 includes a cloud server 110 configured to handle communications between the computing devices 130-140 and the storage system 105. The communications can include data storage or retrieval requests from the computing devices 130-140. In one embodiment, the cloud server 110 can be a server cluster having computer nodes interconnected with each other by a network. The server cluster can communicate with storage system via the Internet or communication networks. The storage system 105 contains storage nodes 112. Each of the storage nodes 112 contains one or more processors 114 and storage devices 116. The storage devices 116 can include optical disk storage, RAM, ROM, EEPROM, flash memory, phase change memory, magnetic cassettes, magnetic tapes, magnetic disk storage or any other computer storage medium which can be used to store the desired information.

A cloud data interface 120 can also be included to receive data from and send data to computing devices 130-140. The cloud data interface 120 can include network communication hardware and network connection logic to receive the information from computing devices. The network can be a local area network (LAN), wide area network (WAN) or the Internet. The cloud data interface 120 may include a queuing mechanism to organize data updates received from or sent to the computing devices 130-140.

Although FIG. 1 illustrates two computing devices 130-140, a person having ordinary skill in the art will readily understand that the technology disclosed herein can be applied to a single computing device or more than two computing devices connected to the cloud server 110.

The computing devices 130-140 include an operating system 132-142 to manage the hardware resources of the computing devices 130-140 and provide services for running computer applications 134-144 (e.g., mobile applications running on mobile devices). The operating system 132-142 facilitates execution of the computer applications 134-144 on the computing device 130-140. The computing devices 130-140 include at least one local storage device 138-148 to store the computer applications 134-144 and user data. The computing device 130 or 140 can be a desktop computer, a laptop computer, a tablet computer, an automobile computer, a game console, a smartphone, a personal digital assistant, or other computing devices capable of running computer applications, as contemplated by a person having ordinary skill in the art.

The computer applications 134-144 stored in the computing devices 130-140 can include applications for general productivity and information retrieval, including email, calendar, contacts, and stock market and weather information. The computer applications 134-144 can also include applications in other categories, such as mobile games, factory automation, GPS and location-based services, banking, order-tracking, ticket purchases or any other categories as contemplated by a person having ordinary skill in the art.

The operating system 132-142 of the computing devices 130-140 includes socket redirection modules 136-146 to redirect network messages. The computer applications 134-144 generate and maintain network connections directed to various remote servers (not illustrated). The remote servers can include applications, products or services such as social networking applications that the users may interact with via the computer applications 142-144. Instead of directly opening and maintaining the network connections with these remote servers, the socket redirection modules 136-146 route all of the network messages for these connections of the computer applications 134-144 to the cloud server 110. The cloud server 110 is responsible for opening and maintaining network connections with the remote servers.

All or some of the network connections of the computing devices 130-140 are through the cloud server 110. The network connections can include Transmission Control Protocol (TCP) connections, User Datagram Protocol (UDP) connections, or other types of network connections based on other protocols. When there are multiple computer applications 134-144 that need network connections to multiple remote servers, the computing devices 130-140 only need to maintain one network connections with the cloud server 110. The cloud server 110 will in turn maintain multiple connections with the remote servers on behalf of the computer applications 134-144.

In various embodiments, the cloud server 110 maintains a certain level of durability of the data files stored at the storage system 105 by regulating a number of copies of the data files stored at the storage system 105. In some embodiments, the cloud server 110 regulates the number of copies of the data files based on the popularity values of the data files. For example, the more popular the data files are among the users, the fewer the number of copies of the data files stored at the storage system. Additional details with respect to regulating number of copies of the data files based on the popularity values are described at least with reference to FIGS. 2-7.

FIG. 2 illustrates an example system that regulates a number of copies of data files stored at a storage system based on a popularity value of the corresponding data file, consistent with various embodiments. In some embodiments, the system 200 can be similar to a system such as system 100 of FIG. 1. In some embodiments, the server 230 is similar to the cloud server 110 and the storage system 235 can be similar to storage system 105. In the figure, the storage system 235 a is a non-regulated data storage, and storage systems 235 b-d, are examples of regulated storage systems in which the number of copies of data files are regulated based on the popularity of data files. In an embodiment the storage systems 235 a-d form a storage system 235 of the system 200.

The server 230 provides data storage services to a number of users, including a first user, a second user and a third user to store various data files. The data files can include files such as images, videos, logs, application configuration files, computing device configuration files etc. A user can upload data files from one or more computing devices associated with the user to the server 230 via a communication network 225. For example, a first user can upload data file, File A, from a first computing device 205 and a second computing device 210. Similarly, the third computing device 215 uploads data file, “File A” and “File B” and the fourth computing device 220 “File A,” “File B” and “File C.” Accordingly, the server 230 stores four copies of data file, “File A” in the storage system 235, two copies of “File B” and one copy of “File C.” The storage system 235 a can have a number of storage units across which the data files can be stored. Further, in some embodiments, the storage units can be spread across various geographical locations.

Typically, a storage system keeps a number of copies of the data files in order to improve the durability of data files, e.g., to minimize the impact due to data loss either at the user end or at the storage system end. In some embodiments, the server 230 maintains a certain level of durability of the data files stored at the storage system 235 a by regulating a number of copies of the data files stored at the storage system 235 a based on the popularity of the data files. The more popular the data files are among the users, the lower the number of copies of the data files are stored at the storage system.

The popularity of a particular data file is measured using a popularity value. In some embodiments, the popularity value of the particular data file is determined based on a number of factors, including one or more of: (a) a number of computing devices associated with one or more of the users that contain the particular data file, (b) a latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, (c) a network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, (d) availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file, (e) a number of the users requiring storage for the same data file at the storage system, or (f) access pattern of the particular data file for a specific user or a subset of the users. In some embodiments, one or more the above factors can be weighted relative to each other and an overall popularity value of the particular data file can be determined as a function of the popularity value for one or more of the above factors.

In some embodiments, the higher the number of computing devices that contain the particular data file, the higher is the popularity value of the particular data file. This may indicate that since the particular data file is available from many computing devices, a lesser number of copies, including zero, may be stored at the storage system 235. When a user requests to retrieve the particular data file, the server obtains the particular data file from one of the computing devices and serves the particular data file to the user.

In some embodiments, the higher the latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, the lower the popularity value of the particular data file is. In some embodiments, if the latency is above a maximum acceptable value, the server may determine to store a higher number of copies at the storage system 235. In some embodiments, an overall latency based popularity value may be determined as an average of or as any other function of latency based popularity value of the particular data file for each of the computer devices that contain the particular data file.

In some embodiments, the higher the network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, the higher popularity value. In some embodiments, an overall network bandwidth based popularity value may be determined as an average or as any other function of network bandwidth based popularity value of the particular data file for each of the computer devices that contain the particular data file.

In some embodiments, the higher the availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file higher the popularity value of the particular data file. In some embodiments, an overall network connection availability based popularity value may be determined as an average or as any other function of network connection availability based popularity value of the particular data file for each of the computer devices that contain the particular data file.

In some embodiments, the higher the number of the users requiring storage for the same data file at the storage system the higher the popularity value of the particular data file.

In some embodiments, the access pattern of the particular data file is considered for determining the popularity value. The access pattern can be based on how frequently the particular data file stored at the storage system 235 a is accessed or requested by a user who has uploaded the particular data file. The higher the frequency of access, the higher the number of copies stored at the storage system. If the frequency of access is high, the server 230 may determine to store one or more copies on the storage system since it may be faster and more efficient to retrieve the data file from the storage system rather than the computing devices of the users that contain the copy of the particular data file. Accordingly, the higher the frequency of access the lower the popularity value. Further, in some embodiments, the access pattern of the particular data file may be considered not only for a particular user but also for a subset of the users.

The popularity value can be determined in various units and using various mathematical equations. One example expression of a popularity value can include a percentage value, where a popularity value of 100% can indicate that all the users serviced by the storage system have a copy of the particular data file on all their computing devices, the particular data file can be fetched from any of the computing devices with a minimum latency, the particular data file is accessed frequently etc. On the other hand, a popularity value of 0% can indicate that none of the users have a copy of the particular data file or it is not possible retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased or decreased, including from/to zero, based on the popularity value. For example, if the popularity value of a particular data file is 100% the storage system may not store any copies of the particular data file since the particular data file is available at all the computing devices of the users and can be retrieved from any of the computing devices at any time. On the other hand, if the popularity value of a particular data file is 0% the storage system may store one or more copies of the particular data file since the particular data file is not available at any of the computing devices or cannot be retrieved within a maximum accepted latency etc. Generally, the higher the popularity, the lower the number of copies of the data file stored at the storage system. Further, various popularity value ranges and number of copies that can be stored for each of the ranges can be configured, e.g., by an entity such as an administrator of the storage server.

Referring back to the non-regulated storage system 235 a, the storage system 235 a includes four copies of “File A,” two copies of “File B” and a copy of “File C.” The server 230 may adjust the number of copies of the above mentioned data files in one or more of the following ways:

Regarding “File A,” the server 230 may determine that “File A” has a high popularity value, e.g., because each of the four computing devices has a copy of “File A”, the availability of network connection with one or more of the computing devices is high, etc. Accordingly, the server 230 may decrease the number of copies of “File A” by half as shown in regulated storage systems 235 b-c. In some embodiments, the server 230 may even determine not to store any copy of “File A” in the storage system as shown by example storage system 235 d.

Regarding “File B,” the server 230 may determine to retain the same number of copies based on the popularity value of “File B.” Regarding, “File C,” in some embodiments, the popularity value may indicate that that one of copy of “File C” is sufficient to be stored at the storage system, for e.g., because only one computing device needs the file, the file is not accessed as frequently, etc. Accordingly, the server 230 stores only one copy of “File C” as shown in the example storage system 235 b. However, in some embodiments, the popularity value of “File C” may change even with just one user, e.g., if the user is travelling and the network connectivity between the fourth computing device 220 and the storage unit in the storage system 235 a that contains the copy of “File C” may change when the user is at another geographical location. The popularity value of “File C” can change and therefore can have an effect on the number of copies stored at the storage system. The popularity value may indicate that two copies of the file be maintained at the storage system. Accordingly, the server 230 may add another copy of “File C” at the storage system as shown in regulated storage systems 235 c-d. In some embodiments, the server 230 may add another copy of the “File C” in the storage unit of storage systems 235 c-d that is closer to the location where the user has travelled to.

In some embodiments, the server 230 determines whether various data files uploaded by different users are similar by using various file comparison techniques such as checksum, hash sum etc. The server 230 generates a checksum for each of the files uploaded to the server 230 for further storage at storage system 235 and stores the checksum of each of the data files in the storage system 235 or in another storage system separate from the storage system 235. The checksums may be calculated for a portion of the data file, e.g., a block of a file or a segment of file that has a plurality of blocks, or a complete data file. Further, the server 230 also stores the identifications of at least one of the user and the computing device which uploaded a particular data file. In some embodiments, the checksums and the identifications of the users and/or computing devices are stored in a data file availability table (not illustrated). The server 230 may use the data file availability table in determining the popularity value and also in determining which of the computing devices has a particular data file.

In some embodiments, the server 230 can use various storage techniques to store data efficiently. One example storage technique can include compression of data files that compresses the data files so that the space consumed by the data file is minimized. The computing devices can include devices such as a smart phone, a digital media player, a laptop, a desktop, a tablet PC etc.

FIG. 3 illustrates an example of a system 300 for serving a particular data file from the storage system, consistent with various embodiments. In some embodiments, the system 300 can be similar to the system 200 of FIG. 2, server 330 can be similar to the server 230 the computing devices 305-320 can be similar to the computing devices 205-220, respectively, and the storage system 350 can be similar to the storage system 235 d. The users associated with the first computing device 305, second computing device 310, the third computing device 315 and the fourth computing device have uploaded one or more of data files “File A,” “File B” and “File C” to the server 330 for storage as illustrated with reference to FIG. 2.

The server 330 has adjusted the number of copies of the data files stored at storage system 350. For example, while the server 330 has stored two copies of “File B” and “File C” no copies of “File A” are stored at the storage system 350, e.g., because the “File A” has a high popularity value due to being available from a number of computing devices.

A computing device such as the third computing device requests the server 330 to retrieve “File A” that it had uploaded earlier. The server 330 determines whether the storage system 350 has a copy of “File A.” If the storage server 350 has a copy of “File A,” then the server obtains the data file from the storage server 350 and serves the data file to the third computing device 315. On the other hand, if the storage server 350 does not have a copy of “File A,” the server 330 determines which of the computing devices has a copy of “File A.” In some embodiments, the availability table 325 includes data specifying which of the computing devices has which of the data files and also data specifying other attributes such as network bandwidth for the computing devices, their network connection availability, associated latency to obtain the data file, etc.

The server 330 checks with the availability table 325 to determine which of the computing devices has a copy of “File A” and identifies a particular computing device from which it can retrieve a copy of “File A.” In some embodiments, the server 330 may select a computing device, e.g., first computing device 305, from which the copy of “File A” can be retrieved from least amount of latency. The server 330 retrieves the copy of “File A” from the first computing device 305 and serves the data file, “File A” to the third computing device 315. The third computing device 315 would not be aware of where the data file is retrieved from. From the perspective of the third computing device 315, the data file, “File A” is retrieved from the storage system 350.

FIG. 4 illustrates a block diagram of a server 400 that regulates the number of copies of the data files stored at the storage system based on the popularity values of the corresponding data files, consistent with various embodiments of the disclosed technique. In some embodiments, the server 400 can be similar to cloud server 110 of FIG. 1. The server 400 can be, e.g., a dedicated standalone server, or implemented in a cloud computing service. The server 400 includes a network component 410, a processor 420, a memory 430, a request receiving module 440, a popularity value determination module 450, data file replication management module 460 and a data file serving module 470. The memory 430 can include instructions which when executed by the processor 420 enables the server 400 to perform the functions as described with reference to cloud server 110. The networking component 410 is configured for network communications with computing devices and remote servers (not illustrated). The networking component 410 establishes a device network connection with a computing device, and a server network connection with the storage system 105 in response to a request from the computing device for connecting with the storage system 105. The request can be generated by a computer application running at the computing device.

As explained above, the server 400 facilitates storing of data files of the users at a storage system such as storage system 105. The data files can be received from one or more users and also from one or more computing devices of each of the users. For example, a user can be associated with multiple computing devices such as smartphones, digital media players, laptops, desktops, tablet PCs etc. The data files can include files such as images, videos, logs, application configuration files, computing device configuration files etc. The server 400 maintains a certain level of durability of the data files stored at the storage system 105 by regulating a number of copies of the data files stored at the storage system 105. In some embodiments, the more popular the data files are among the users, the lower the number of copies of the data files stored at the storage system 105.

The popularity of a particular data file is measured using a popularity value. The popularity value determination module 450 determines the popularity of the data file based on a number of factors, including one or more of: (a) a number of computing devices associated with one or more of the users that contain the particular data file, (b) a latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, (c) a network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, (d) availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file, (e) a number of the users requiring storage for the same data file at the storage system, or (f) access pattern of the particular data file for a specific user or a subset of the users. In some embodiments, one or more the above factors can be weighted relative to each other. The popularity value can be determined in various units and using various mathematical equations. One example expression of a popularity value can include a percentage value.

The data file replication management module 460 determines the number of copies to be maintained at the storage system 105 for a particular data file. Generally, higher the popularity of the data file, lower is the number of copies of the data file stored at the storage system 105. Further, various popularity value ranges and number of copies that can be stored for each of the ranges can be configured, e.g., by an entity such as an administrator of the storage server. Further, in some embodiments, the data file replication management module 460 also maintains an availability table that includes data specifying which of the computing devices has copies of which of the data files, and also includes data specifying other attributes such as network bandwidth for the computing devices, their network connection availability, associated latency to obtain the data file, etc.

Request receiving module 440 receives requests from the users for storing or retrieving data files at/from the storage system 105. In some embodiments, the request receiving module 440 receives the request via the network component that facilitates communication with the computing devices of the users.

Data file serving module 470 responds to the requests from a user for retrieving the data files from storage system 105 by retrieving the data file and serving it to the user. The data file serving module 470 serves the data file by either retrieving the data file from the storage system 105 or from one of the computing devices if the storage system does not have the requested data file. In some embodiments, the data file serving module 470 checks with the availability table to determine which of the computing devices has a copy of the requested data file, and retrieves the copy of data file from one of the identified computing devices.

FIG. 5 illustrates a flow diagram for regulating data storage at a storage system based on a popularity value of data files, consistent with various embodiments. The process 500 may be executed in a system such as system 100 of FIG. 1. At step 505, the server 110 receives a request to store a data file from one or more users. In some embodiments, if more than one user uploads the same data file, multiple copies of the data file is created. At step 510, the server 110 stores the multiple copies of the data file at the storage system 105.

At step 515, the server determines a popularity of the data file. In some embodiments, the popularity of a data file is measured using a popularity value. The popularity value of the data file is determined based on a number of factors, including one or more of: (a) a number of computing devices associated with one or more of the users that contain the particular data file, (b) a latency associated with reading the particular data file from one or more of the computing devices that contain the particular data file, (c) a network bandwidth available for reading the particular data file from one or more of the computing devices that contain the particular data file, (d) availability of a network connection with one or more of the computing devices that contain the particular data file for reading the particular data file, (e) a number of the users requiring storage for the same data file at the storage system, or (f) access pattern of the particular data file for a specific user or a subset of the users. In some embodiments, one or more the above factors can be weighted relative to each other. The popularity value can be determined in various units and using various mathematical equations. One example expression of a popularity value can include a percentage value.

At step 520, the server 110 determines a number of copies of the data file to be stored at the storage system 105 based on the popularity value. In some embodiments, higher the popularity of the data file, lower is the number of copies of the data file stored at the storage system. For example, if the popularity value of a particular data file is 100%, the storage system may not store any copies of the particular data file since the particular data file is available at all the computing devices of the users and can be retrieved from any of the computing devices at any time. On the other hand, if the popularity value of a particular data file is 0% the storage system may store one or more copies of the particular data file since the particular data file is not available at any of the computing devices or cannot be retrieved within a maximum accepted latency etc. Further, various popularity value ranges and number of copies that can be stored for each of the ranges can be configured, e.g., by an entity such as an administrator of the storage server. For example, a popularity range of 0-9% may have 5 copies, 10-40% may have 4 copies, 41-70% may have 3 copies, 71-95% may have 2 copies and 96-100% may have 0 (Zero) copies.

At step 525, the server 110 regulates or adjusts the number of copies of the data file at the storage system by at least one of: (a) not storing any copy of the data file at the storage system if the popularity value exceeds a first threshold, (b) increasing the number of copies stored at the storage system if the popularity value is below a second threshold, or (c) decreasing the number of copies stored at the storage system if the popularity value exceeds a third threshold. In some embodiments, the number of copies of a data file can be regulated either for a complete data file or for a portion of the data file.

FIG. 6 illustrates an example process for serving a particular data file from the storage system, consistent with various embodiments. In an embodiment, the process 600 may be implemented in a system such as system 100 of FIG. 1. At step 605, the server 110 receives a request from a user to retrieve a data file of the user from a storage system. In some embodiments, the user can have one or more computing devices associated with the user. The user may request using any of the computing devices. At step 610, the server determines whether the storage system 105 has an entire copy of the requested data file. Responsive to a determination that the storage system 105 has the entire copy of the requested data file, at step 645, the server 110 serves the copy of the requested data file to the user from the storage system 105.

On the other hand, responsive to a determination that the storage system 105 does not have a copy of the entire data file, at step 615, the server 110 determines whether the storage system 105 has a portion of the requested data file. Responsive to a determination that the storage system has a portion of the requested file, e.g., a first block or segment etc., at step 620, the server 110 determines which of the computing devices of other users have a copy of the remaining portions of the requested data file. In some embodiments, the server checks with the availability table to determine which of the computing devices has a copy of the requested data file.

In some embodiments, the server 110 generates a checksum for each of the data files uploaded by the users to the server 110 for storage of the data files. The checksums may be calculated for a portion of the data file, e.g., a block of a file or a segment of file that has a plurality of blocks, or a complete data file. The server 110 stores the checksums of the data files in the availability table. In some embodiments, the server 110 also stores other attributes such as the names of the data file, identifications of the computing devices from which the data files are uploaded, a network bandwidth available for reading the copy of files from the corresponding computing devices, a network availability for connecting with the corresponding computing devices, associated latency etc. Some of the foregoing attributes may be updated periodically.

In some embodiments, the server 110 compares a checksum of the requested data file with the stored checksums of the data files to determine if any of the computing devices has the copy of the requested data file. The server 110 chooses one of the computing devices to retrieve a copy of the requested data file from based on a predefined criterion. For example, the server 110 can choose a computing device from which the copy of the data file can be read with least latency.

At step 625, the server 110 retrieves the copy of the remaining portions of the data file from one of the computing devices. At step 630, the server 110 generates an entire copy of the requested data file using the portions retrieved from the identified computing device and the storage system 105. The server 115 can use various file joining techniques for generating a file using various portions of the file. At step 645, the server 110 serves the copy of the data file to the user.

Referring back to step 615, responsive to a determination that the storage system does not have a portion of the requested file, at step 635, the server 110 determines which of the computing devices of other users have a copy of the entire requested data file. At step 640, the server 110 retrieves the copy of the entire data file from one of the computing devices and, at step 645, the server 110 serves the copy of the data file to the user.

Regardless of whether the data file is retrieved from the storage system 105 or from the computing devices of the users, from the perspective of the user who requested the data file, the user sees the data file as being served from the storage system 105. The user may be unaware of the fact that the data file is retrieved from a computing device of another user.

FIG. 7 is a block diagram of a computer system as may be used to implement features of some embodiments of the disclosed technology. The computing system 700 may be used to implement any of the entities, components or services depicted in the examples of FIGS. 1-6 (and any other components described in this specification). The computing system 700 may include one or more central processing units (“processors”) 705, memory 710, input/output devices 725 (e.g., keyboard and pointing devices, display devices), storage devices 720 (e.g., disk drives), and network adapters 730 (e.g., network interfaces) that are connected to an interconnect 715. The interconnect 715 is illustrated as an abstraction that represents any one or more separate physical buses, point to point connections, or both connected by appropriate bridges, adapters, or controllers. The interconnect 715, therefore, may include, for example, a system bus, a Peripheral Component Interconnect (PCI) bus or PCI-Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (I2C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also called “Firewire”.

The memory 710 and storage devices 720 are computer-readable storage media that may store instructions that implement at least portions of the described technology. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer-readable media can include computer-readable storage media (e.g., “non-transitory” media) and computer-readable transmission media.

The instructions stored in memory 710 can be implemented as software and/or firmware to program the processor(s) 705 to carry out actions described above. In some embodiments, such software or firmware may be initially provided to the processing system 700 by downloading it from a remote system through the computing system 700 (e.g., via network adapter 730).

The technology introduced herein can be implemented by, for example, programmable circuitry (e.g., one or more microprocessors) programmed with software and/or firmware, or entirely in special-purpose hardwired (non-programmable) circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.

REMARKS

The above description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the disclosure. However, in certain instances, well-known details are not described in order to avoid obscuring the description.

Further, various modifications may be made without deviating from the scope of the invention. Accordingly, the invention is not limited except as by the appended claims.

Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.

The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. For convenience, certain terms may be highlighted, for example using italics and/or quotation marks. The use of highlighting has no influence on the scope and meaning of a term; the scope and meaning of a term is the same, in the same context, whether or not it is highlighted. It will be appreciated that the same thing can be said in more than one way. One will recognize that “memory” is one form of a “storage” and that the terms may on occasion be used interchangeably.

Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. Synonyms for certain terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any term discussed herein is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.

Those skilled in the art will appreciate that the logic illustrated in each of the flow diagrams discussed above, may be altered in various ways. For example, the order of the logic may be rearranged, substeps may be performed in parallel, illustrated logic may be omitted; other logic may be included, etc.

Without intent to further limit the scope of the disclosure, examples of instruments, apparatus, methods and their related results according to the embodiments of the present disclosure are given below. Note that titles or subtitles may be used in the examples for convenience of a reader, which in no way should limit the scope of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, the present document, including definitions will control. 

I/We claim:
 1. A method of regulating data storage, the method comprising: receiving, at a server, a data file from one or more users to generate multiple copies of the data file; storing, by the server, the copies of the data file at a storage system; determining, by the server, a popularity value of the data file, the popularity value indicating a popularity of the data file among the one or more users; determining, by the server, a number of copies of the data file to be stored at the storage system based on the popularity value; and adjusting, by the server, the number of copies of the data file stored at the storage system based on the popularity value.
 2. The method of claim 1, wherein adjusting the number of copies stored at the storage system includes increasing the number of copies stored at the storage system according to a popularity value range the popularity value of the data file corresponds to.
 3. The method of claim 1, wherein adjusting the number of copies stored at the storage system includes decreasing the number of copies stored at the storage system according to a popularity value range the popularity value of the data file corresponds to.
 4. The method of claim 1, wherein adjusting the number of copies stored at the storage system includes not storing any of the copies of the data file at the storage system if the popularity value exceeds a threshold popularity value.
 5. The method of claim 1, wherein receiving the data file from one or more users includes receiving the data file from one or more computing devices associated with each of the one or more users.
 6. The method of claim 1, wherein the copies of the data file include copies of a portion of the data file.
 7. The method of claim 1, wherein determining the popularity value of the data file includes determining the popularity value as a function of a number of computing devices associated with the one or more users that contain the data file.
 8. The method of claim 7, wherein determining the popularity value of the data file includes determining the popularity value as a function of a latency associated with reading the data file from one or more of the computing devices that contain the data file.
 9. The method of claim 7, wherein determining the popularity value of the data file includes determining the popularity value as a function of a network bandwidth available for reading the data file from one or more of the computing devices that contain the data file.
 10. The method of claim 7, wherein determining the popularity value of the data file includes determining the popularity value as a function of availability of a network connection with one or more of the computing devices that contain the data file for reading the data file.
 11. The method of claim 1, wherein determining the popularity value of the data file includes determining the popularity value as a function of a number of the one or more users requiring storage for the same data file at the storage system.
 12. The method of claim 1, wherein determining the popularity value of the data file includes determining the popularity value as a function of access pattern of the data file for a specific user of the one or more users.
 13. The method of claim 1, wherein determining the popularity value of the data file includes determining the popularity value as a function of access pattern of the data file for a subset of the one or more users.
 14. A method comprising: receiving, at a server and from a first computing device associated with a first user, a request to retrieve a first data file of the first user from a storage system, the storage system configured to store a plurality of data files of a plurality of users based on a plurality of popularity values of the corresponding data files; determining, by the server, whether storage system has the first data file; responsive to a determination that the storage system does not have the copy of the first data file, determining a plurality of computing devices associated with the users that have a copy of the first data file; retrieving, by the server, the copy of the first data file from one of the computing devices; and serving, by the server, the copy of the first data file to the first user.
 15. The method of claim 14, wherein storing the data files of the users in the storage system includes determining, by the server and for a data file of the data files, a popularity value of the data file, the popularity value indicating a popularity of the data file among the users, determining, by the server and based on the popularity value, a number of copies of the data file to be stored at the storage system, and adjusting, by the server, the number of copies of the data file stored at the storage system based on the popularity value.
 16. The method of claim 15, wherein determining the popularity value of the data file includes determining the popularity value as a function of a number of computing devices associated with the users that contain the data file.
 17. The method of claim 16, wherein determining the popularity value of the data file includes determining the popularity value as a function of at least one of: (a) a latency associated with reading the data file from the computing devices that contain the data file, (b) a network bandwidth available for reading the data file from the computing devices that contain the data file or (c) availability of a network connection with the computing devices that contain the data file for reading the data file.
 18. The method of claim 14, wherein adjusting the number of copies stored at the storage system includes at least one of: (a) not storing any of the copies of the data file at the storage system if the popularity value exceeds a first threshold, (b) increasing the number of copies stored at the storage according to a popularity value range the popularity value of the data file corresponds to or (c) decreasing the number of copies stored at the storage system according to a popularity value range the popularity value of the data file corresponds to.
 19. The method of claim 14, wherein determining a plurality of computing devices associated with the users that have a copy of the first data file includes determining, by the server, a checksum of each of the data files received from the users to generate a plurality of checksums, storing the checksums of the data files and identifications of the computing devices having the copy of the data files at the storage system, and comparing a first checksum of the first data file with the checksums of the data files to determine if any of the computing devices has the copy of the first data file.
 20. An apparatus comprising: a storage system configured to store a plurality of data files received from a plurality of users based on a popularity value of each of the data files; a popularity value determination module to determine the popularity value for each of the data files, the popularity value indicating a popularity of the corresponding data file among the users; and a data file replication management module to determine, based on the popularity value, a number of copies of the corresponding data file to be stored at the storage system, and adjusting, based on the popularity value, the number of copies of the corresponding data file stored at the storage system.
 21. The apparatus of claim 20 further comprising: a request receiving module to receive from a first computing device associated with a first user a request to retrieve a first data file of the first user from the storage system, the first data file being one of the data files; and a data file serving module to determine whether storage system has a copy of the first data file, responsive to a determination that the storage system does not have the copy of the first data file, determine a plurality of computing devices associated with the users that have the copy of the first data file, retrieve the copy of the first data file from one of the computing devices, and serve the copy of the first data file to the first user.
 22. A method comprising: receiving, at a server and from a first computing device associated with a first user, a request to retrieve a first data file of the first user from a storage system, the storage system configured to store a plurality of data files of a plurality of users based on a plurality of popularity values of the corresponding data files, the storage system configured to store portions of the data files; determining, by the server, whether storage system has entire first data file or a portion of the first data file; responsive to a determination that the storage system has the portion of the first data file, determining a plurality of computing devices associated with the users that have a copy of remaining portions of the first data file; retrieving, by the server, the copy of the remaining portions of the first data file from one of the computing devices; and serving, by the server, the copy of the entire first data file to the first user, the entire first data file generated using the portion retrieved from the storage system and the remaining portions retrieved from the one of the computing devices. 